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Relational  Coarsest  Partition  Problems  (RCPPs)  play  a  vital  role  in  verifying  con¬ 
current  systems.  It  is  known  that  RCPPs  are  P-complete  and  hence  it  may  not  be 
possible  to  design  polylog  time  parallel  algorithms  for  these  problems. 

In  this  paper,  we  present  two  efficient  parallel  algorithms  for  RCPP,  in  which  its 
associated  label  transition  system  is  assumed  to  have  m  transitions  and  n  states.  The 
first  algorithm  runs  in  0(n1+£)  time  using  ^  CREW  PRAM  processors,  for  any  fixed 
e  <  1.  This  algorithm  is  analogous  and  optimal  with  respect  to  the  sequential  algorithm 
of  Kanellakis  and  Smolka.  The  second  algorithm  runs  in  O(nlogn)  time  using  ^logn 
CREW  PRAM  processors.  This  algorithm  is  analogous  and  nearly  optimal  with  respect 
to  the  the  sequential  algorithm  of  Paige  and  Tarjan. 


1  Introduction 

Relational  Coarsest  Partition  Problems  play  an  important  role  in  verifying  concurrent  sys¬ 
tems  in  the  form  of  equivalence  checking.  In  their  pioneering  work,  Kanellakis  and  Smolka 
[7]  present  an  efficient  algorithm  for  RCPP  with  multiple  relations.  Their  algorithm  has  a 
run  time  of  O(nMi),  where  m  is  the  total  number  of  transitions  and  n  is  the  number  of  states 

‘This  research  was  supported  in  part  by  ONR  N00014-89-J- 1 131,  DARPA/NSF  CCR90-14621  and  ARO 
DAAL  03-89-C-0031. 
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in  the  RCPP.  Subsequently,  Paige  and  Tarjan  [10]  show  that  RCPP  (with  a  single  relation) 
can  be  solved  in  0(m  log  n)  time.  Both  these  algorithms  have  been  used  in  practice  to  verify 
systems  with  thousands  of  states.  The  goal  of  this  paper  is  to  extend  the  applicability  of 
these  algorithms  with  the  use  of  parallelism. 

In  a  recent  work  of  Zhang  and  Smolka  [11],  an  attempt  has  been  made  to  parallelize 
the  classical  Kanellakis-Smolka  algorithm.  However,  the  main  thrust  of  this  work  was  from 
practical  considerations.  In  particular,  complexity  analysis  has  not  been  provided  and  was 
not  the  main  concern  of  this  paper.  On  the  other  hand,  it  has  been  shown  that  RCPP  (even 
when  there  is  only  a  single  function)  is  P-complete  [1].  "P-complete  problems  are  presumed 
to  be  problems  that  are  hard  to  efficiently  parallelize.  It  is  widely  believed  that  there  may 
not  exist  polylog  time  parallel  algorithms  for  any  of  the  "P-complete  problems  that  use  only 
a  polynomial  number  of  processors. 

Since  RCPP  has  been  proven  to  be  "P-complete,  we  restrict  our  attention  to  designing 
polynomial  time  algorithms.  In  this  paper  we  present  two  parallel  algorithms  for  RCPP:  1) 
An  algorithm  that  runs  in  0(n1+£)  time  using  %  CREW  PRAM  processors  for  any  fixed 
e  <  1;  the  same  algorithm  runs  in  time  O(nlogn)  using  log  log  n  CRCW  PRAM  pro¬ 
cessors;  and  2)  An  algorithm  that  runs  in  time  0(n  log  n)  using  only  ^  logn  CREW  PRAM 
processors.  The  first  algorithm  is  optimal  with  respect  to  Kanellakis-Smolka  algorithm.  We 
say  a  parallel  algorithm  that  runs  in  time  T  using  P  processors  is  optimal  with  respect  to 
a  sequential  algorithm  with  a  run  time  of  S,  if  PT  =  O(S),  i.e.,  the  work  done  by  the 
parallel  algorithm  is  asymptotically  the  same  as  that  of  the  sequential  algorithm.  The  two 
parallel  algorithms  described  in  this  paper  are  for  single  relation  RCPP.  They  can,  however, 
be  easily  extended  for  multiple  relation  RCPP  without  changing  their  run-time  complexities. 

The  rest  of  the  paper  is  organized  as  follows.  In  Section  2,  we  provide  some  definitions 
and  useful  facts  about  parallel  computation.  In  Sections  2  and  3,  we  provide  our  two  algo¬ 
rithms,  respectively.  Finally  in  Section  4,  we  provide  concluding  remarks  and  list  some  open 
problems. 

2  Problem  Definitions 

Definition  1  A  labeled  transition  system  (LTS)  M  is  (Q,Qo,  A,T),  where  Q  is  a  set  of 
states,  Q0  C  Q  is  a  set  of  initial  states,  A  is  a  finite  set  of  alphabet,  TCQxAxQ  is  a 
transition  relation. 
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For  a  given  LTS  M  =  (Q,Q0,A,T),  we  define  functions  Ta,Ta  1  from  Q  to  2Q  for  every 
a  £  A  as  follows: 

Ta(p)  =  {q\{p,a,q)  (=T} 

Ta-\q)  =  {p\{p,a,q)  eT} 

That  is,  Ta(p)  is  the  set  of  next  states  of  p  and  Tf1(q)  is  the  set  of  states  which  can  lead  to 
q  via  a.  We  extend  the  functions  Ta  and  T~l  from  2Q  to  2Q .  That  is,  for  every  a  G  A  and 
SQQ, 

Ta(S)  =  U  Up),  UiS)  =  U  T-\P) 

pes  pes 

Given  a  set  S,  a  partition  of  S  is  a  set  of  disjoint  sets  whose  union  is  equal  to  S.  We  say 
that  a  partition  tt'  =  {B[, . . .  B'n]  is  a  refinement  of  a  partition  7r  =  {Bi, . . .  Bm}  if  every  B[ 
is  contained  in  some  Bj. 

We  can  represent  an  equivalence  relation  ir  C  Q  x  Q  as  a  partition  {Bi\i  G  1}  where  each 
block  B{  represents  an  equivalence  class  in  7r. 

For  a  state  q  G  Q  and  a  subset  S  of  Q,  let  [q]v  denote  the  block  in  partition  n  which 

includes  q,  and  let  [S'],  denote  the  set  of  blocks  in  partition  n  which  include  some  state  in 

S,  that  is,  [5],  =  {[q}ir\q  G  S}. 

The  notion  of  bisimulation  equivalence  as  defined  by  Milner  in  [9]  is  used. 

Definition  2  Given  a  labeled  transition  system  S  =  (Q,Q0,  A,T) ,  a  binary  relation  n  C 
Q  x  Q  is  a  (strong)  bisimulation  iff 

V(pi,p2)  G  7T.  Va  G  A.(  G  Ta(pi)  =»  3g2-(?2  G  Ta(p2)  A  (91,92)  G  7r))A 

V92.(92  G  Ta(p2)  39i-(?i  ^  ^a(pi)  A  (91,92)  G  tt))). 

For  p,  9  G  Qi  p  and  9  are  said  to  be  bisimilar ,  denoted  by  p  ~  9,  if  (p,  9)  G  7r  for  some 
bisimulation  7 r  G  Q  X  Q. 

Definition  3  Suppose  that  a  LTS  Mi  =  (Qi,  Q01,  A,  T\)  and  a  LTS  M2  —  (Q2,  Q02,  A,  T2). 
We  say  two  LTS  Mi  and  M2  are  bisimilar  if  for  every  p  G  Q01,  there  exists  9  G  Q 02  such 
that  p  and  q  are  bisimilar  in  a  LTS  M  =  (Qi  U  Q2,  Q01  U  Q02,  A,  Ti  U  T2) ,  and  vice  versa. 

To  show  whether  or  not  two  states  are  bisimilar,  it  suffices  to  show  that  there  is  a  bisimulation 
relation  that  includes  both  of  them  in  the  same  equivalence  class. 

There  are  again  two  important  problems  in  LTS:  the  bisimulation  testing  problem  and 
the  greatest  bisimulation  finding  problem.  The  bisimulation  testing,  for  given  two  LTS’s,  is 
to  decide  whether  or  not  they  are  bisimilar. 
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The  greatest  bisimulation  of  a  given  labeled  transition  system  is  a  bisimulation  such  that 
any  bisimulation  relation  in  the  system  is  a  refinement  of  it.  For  a  given  LTS  M,  finding  the 
greatest  bisimulation  is  the  same  as  finding  the  minimum  LTS  that  is  bisimilar  to  M. 

The  state  minimization  problem,  for  a  given  LTS  M  =  ( Q ,  Qq ,  A ,  T ),  is  to  find  a  bisimilar 
LTS  M'  =  ( Q ',  Q o,  A,  T')  with  the  smallest  possible  number  of  states. 

Suppose  7r  is  the  greatest  bisimulation  of  a  LTS  M  =  (Q,Qo,  A,T).  Then,  the  minimal 
LTS  of  M  is  the  reduction  of  M  according  to  the  greatest  bisimulation  7 r,  that  is,  M/it  = 
(tt,  [Q0]„,A,TV),  where  T*  =  {([$]*,  a,  [^,]7r)|(^,  a,  q')  £  T}. 

Both  of  these  problems  can  be  solved  by  an  algorithm  for  the  relational  coarsest  parti¬ 
tioning  problem,  which  is  defined  as  follows: 

Relational  Coarsest  Partitioning  Problem  (RCPP) 

Input:  An  LTS  M  =  ( Q,Q0,A,T )  with  a  finite  state  set  Q,  an  initial  partition  7r0  of  a  set 
Q  of  states  and  relations  Ti,  •  •  • ,  Tk  on  Q  x  Q. 

Output:  the  coarsest  (having  the  fewest  blocks)  partition  7r  =  {Ri,  •  •  • ,  B/}  of  Q  such  that 

1.  7r  is  a  refinement  of  7To,  and 

2.  for  every  p,  q  in  block  R;,  for  every  block  Bj  in  7 r,  and  for  every  relation  Tm, 

Tm(p )  n  Bj  t  0  iff  Tm(q)  n  Bj  ±  0 
That  is,  either  R,  C  T~1(BJ)  or  Bi  fl  T~1(Bj )  =  0. 

2.1  Parallel  Computation  Models 

A  large  number  of  parallel  machine  models  have  been  proposed.  Some  of  the  widely  accepted 
models  are:  1)  fixed  connection  machines,  2)  shared  memory  models,  3)  the  boolean  circuit 
model,  and  4)  the  parallel  comparison  trees.  Of  these  we’ll  focus  on  1)  and  2)  only.  The  time 
complexity  of  a  parallel  machine  is  a  function  of  its  input  size.  Precisely,  time  complexity  is 
a  function  p(n)  that  is  the  maximum  over  all  inputs  of  size  n  of  the  time  elapsed  when  the 
first  processor  begins  execution  until  the  time  the  last  processor  stops  execution. 

A  fixed  connection  network  is  a  directed  graph  G(V,  E)  whose  nodes  represent  processors 
and  whose  edges  represent  communication  links  between  processors.  Usually  we  assume  that 
the  degree  of  each  node  is  either  a  constant  or  a  slowly  increasing  function  of  the  number  of 
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nodes  in  the  graph.  Fixed  connection  networks  are  supposed  to  be  the  most  practical  models. 
The  Connection  Machine,  Intel  Hypercube,  ILLIAC  IV,  Butterfly,  etc.  are  examples  of  fixed 
connection  machines. 

In  shared  memory  models  (also  known  as  PRAMs  for  Parallel  Random  Access  Machines), 
processors  work  synchronously  communicating  with  each  other  with  the  help  of  a  common 
block  of  memory  accessible  by  all.  Each  processor  is  a  random  access  machine.  Every  step 
of  the  algorithm  is  an  arithmetic  operation,  a  comparison,  or  a  memory  access.  Several 
conventions  are  possible  to  resolve  read  or  write  conflicts  that  might  arise  while  accessing 
the  shared  memory.  EREYV  (Exclusive  Read  Exclusive  Write)  PRAM  is  the  shared  memory 
model  where  no  simultaneous  read  or  write  is  allowed  on  any  cell  of  the  shared  memory. 
CREW  (Concurrent  Read  Exclusive  Write)  PRAM  is  a  variation  which  permits  concurrent 
read  but  not  concurrent  write.  And  finally,  CRCW  (Concurrent  Read  Concurrent  Write) 
PRAM  model  allows  both  concurrent  read  and  concurrent  write.  Write  conflicts  in  the  above 
models  are  taken  care  of  with  a  priority  scheme. 

The  parallel  run  time  T  of  any  algorithm  for  solving  a  given  problem  can  not  be  less  than 
^  where  P  is  the  number  of  processors  employed  and  S  is  the  run  time  of  the  best  known 
sequential  algorithm  for  solving  the  same  problem.  We  say  a  parallel  algorithm  is  optimal 
if  it  satisfies  the  equality:  PT  -  O(S).  The  product  PT  is  referred  to  as  work  done  by 
the  parallel  algorithm.  We  say  a  parallel  algorithm  that  runs  in  time  T  using  P  processors 
is  optimal  with  respect  to  a  sequential  algorithm  with  a  run  time  of  S,  if  PT  =  0(S ),  i.e., 
the  work  done  by  the  parallel  algorithm  is  asymptotically  the  same  as  that  of  the  sequential 
algorithm. 

The  model  assumed  in  this  paper  is  the  PRAM.  Though  a  PRAM  is  supposed  to  be 
impractical,  it  is  easy  to  design  algorithms  on  this  model  and  usually  algorithms  developed 
for  this  model  can  be  easily  mapped  on  to  more  practical  models.  Also  there  is  a  simulation 
algorithm  that  will  map  any  PRAM  algorithm  into  an  algorithm  for  the  hypercube  network 
(such  as  Ncube,  Intel  Hypercube,  Connection  Machines)  with  at  the  most  a  logarithmic 
factor  of  slow  down  [8].  Thus,  all  the  time  bounds  mentioned  in  this  paper  will  apply  to  the 
above  machines  if  multiplied  by  a  logarithmic  factor. 

2.2  Some  Useful  Facts 

In  this  section,  we  state  some  well-known  results  which  are  used  to  analyze  algorithms 
presented  in  this  paper. 
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Lemma  1  [3]  If  W  is  the  total  number  of  operations  performed  by  all  the  processors  using 
a  parallel  algorithm  in  time  T,  we  can  simulate  this  algorithm  using  P  processors  such  that 
the  new  algorithm  runs  in  time  [jf\  +  T. 

As  a  consequence  of  the  above  Lemma  we  can  also  get: 

Lemma  2  If  a  problem  can  be  solved  in  time  T  using  P  processors,  we  can  solve  the  same 
problem  using  P'  processors  (for  any  P'  <  P)  in  time  O  (jpr)  ■ 

Given  a  sequence  of  numbers  kuk2,...,kn,  the  problem  of  prefix  sums  computation  is  to 
output  the  numbers  ku  h  +  k2, . . . ,  h  +  k2  +  . . .  +  kn.  The  following  Lemma  is  a  folklore  [5]: 

Lemma  3  Prefix  sums  of  a  sequence  of  n  numbers  can  be  computed  in  O(logn)  time  using 
j-2-  EREW  PRAM  processors. 

logn  r 

The  following  Lemma  is  due  to  Cole  [4] 

Lemma  4  Sorting  of  n  numbers  can  be  done  in  O(logn)  time  using  n  EREW  PRAM  pro¬ 
cessors. 

The  following  Lemma  concerns  with  the  problem  of  sorting  numbers  from  a  small  uni¬ 
verse: 

Lemma  5  [2]  n  numbers  in  the  range  [0,  nc]  can  be  sorted  in  0(log  n)  time  using  log  log  n 
CRCW  PRAM  processors,  as  long  as  c  is  a  constant.  The  same  problem  can  be  solved  in 
0(nc)  time  for  any  fixed  e  <  l,  using  %  CREW  PRAM  processors. 

3  Algorithm  I 

In  this  section,  we  present  a  parallel  algorithm  for  RCPP  with  a  single  relation.  This  algo¬ 
rithm  runs  in  time  0(n1+€)  using  %  CREW  PRAM  processors,  for  any  fixed  e  <  1.  The  same 
algorithm  runs  in  O(nlogn)  time  on  a  CRCW  PRAM  using  ^  log  logn  processors.  Since 
our  algorithm  is  analogous  to  the  Kanellakis-Smolka  algorithm,  we  present  their  algorithm 
in  Figure  1  for  the  case  of  a  single  relation  before  we  describe  ours. 

Each  run  of  the  for  loop  of  Kanellakis-Smolka’s  Algorithm  takes  0(m)  time  and  this  loop 
can  be  executed  at  most  n  times.  Thus,  the  run  time  of  this  algorithm  is  0[rnn). 

Figure  2  describes  our  parallel  algorithm,  which  is  based  on  Kanellakis-Smolka  s  Algo¬ 
rithm.  We  first  explain  the  definitions  and  data  structures  used  in  our  algorithm. 
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7T  :=  7T0 

/*  Initially  7 r'  is  empty  */ 

while  7r'  ^  7r  do 
7 t'  :=  tc 

for  every  B  in  7r'  do 

select  a  state  p  from  5 

B,  ;=  0  /*£,  =  {,€  B\T(q)  =  r(rt)  */ 

B,  :=  0  /*  B2  =  {q€  B\T(q)  #  T(p)}  */ 
for  every  <7  in  B  do 

if  [T(g)],/  =  [T(p)]„/  then 
add  q  into  Bi 
else  add  q  into  B2 
end  for 

if  Bi  and  B2  are  not  empty  then  /*  B  is  split  */ 

tt:=  (*-{B})U{BuB2} 

end  for 
end  while 


Figure  1:  Ivanellakis-Smolka’s  Algorithm  for  the  Single  RCPP 


PARTITION 

(1,5) 

(1,7) 

(2,1) 

(2,3) 

(3,2) 

(3,4) 

TRANSITIONS 

(1,2) 

(1,4) 

(2,4) 

(2,5) 

(2,7) 

(3,2) 

.  .  . 

B 

2 

3 

2 

3 

1 

3 

1 

TSIZE 

2 

3 

3 

3 

4 

3 

1 

Table  1:  Contents  of  Data  Structures:  An  Example 
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Definitions  and  Data  Structures.  Let  T(p )  stand  for  {q  €  Q  \  (p,q)  €  T},  i.e.,  T(p)  is 
the  set  of  states  to  which  there  is  a  transition  from  p.  Similarly  define  T~1(p). 

The  current  partition  is  represented  as  an  array  PARTITION .  It  is  an  array  of  size  n 
with  (block  id,  state)  pairs.  For  example,  a  pair  (*,<?)  represents  that  the  state  q  currently 
belongs  to  the  ith  block.  We  maintain  that  the  states  are  stored  in  the  array  PART  IT  ION 
such  that  states  belonging  to  the  same  block  appear  consecutively. 

The  array  TRANSITIONS  is  used  to  store  the  T  relation  of  a  LTS.  In  particular,  the 
array  is  of  size  m  and  each  entry  contains  the  (from-state  id,  to-state  id)  pair.  In  the  array 
TRANSITIONS ,  we  store  the  transitions  of  T(l),  followed  by  the  transitions  of  T{ 2),  and 
so  on.  TSIZE  is  an  array  of  size  n  such  that  TSIZE[q]  stands  for  \T(q)\  for  each  q  in  Q. 
Note  that  the  arrays  TRANSITIONS  and  TSIZE  are  never  altered  during  the  algorithm. 

We  also  maintain  an  array  B  such  that  for  each  state  p  in  Q,  B[p ]  is  the  id  of  a  block  in 
the  current  partition  7r,  which  p  belongs  to.  In  addition,  for  each  state  p  £  Q,  we  let  \p\  stand 
for  the  set,  { B[q]  \  q  £  T(p)}.  We  emphasize  here  that  no  repetition  of  elements  is  permitted 
in  [pj.  For  any  state  q  in  Q,  we  let  [T(q) ]  stand  for  the  sequence  B\pi],  B\p2], .  ■  ■ ,  B\pt\, 
where  T(q)  =  {pi,p2, . . .  ,pj.  Notice  that  [r(<?)]  can  have  multiple  occurrences  of  the  same 
element. 

As  an  example  to  illustrate  our  data  structures,  consider  the  following  initial  partition,  Tip: 
{{5,7},  {1,3},  {2,4,6}}.  Let  the  transition  relation  T  be  defined  as  follows:  T(  1)  =  {2,4}; 
T( 2)  =  {4,5,7};  T(3)  =  {2,6,7};  T(4)  =  {1,5,6};  T(5)  =  {1,2, 6, 7};  T(6)  =  {2,4,6}; 
T(  7)  =  {1}. 

Table  1  shows  the  contents  of  PARTITION ,  TRANSITIONS ,  B,  and  TSIZE  at  the 


beginning. 

Assume  that  there  is  a  processor  associated  with  each  transition  and  each  state  of  the 
LTS.  At  the  beginning,  PARTITION  has  tuples  corresponding  to  the  initial  partition.  The 
array  TRANSITIONS  never  gets  modified  in  the  algorithm.  Array  B  is  also  initialized 
appropriately.  For  any  state  q,  processors  associated  with  T(q )  will  know  the  position  of 
state  q  in  the  array  PARTITION . 

The  algorithm  repeats  as  long  as  there  are  possibilities  of  splitting  at  least  one  of  the 
blocks  in  the  current  partition  and  is  described  in  Figure  2.  Given  that  7 r  =  {B\,  52, . . . ,  Be}, 
Bt  =  {qi,i,qi,2,...,qi,ni},  and  T(qtJ)  =  {pi,yi,pt(j,2,  •  •  •  ,Pij,miti },  Steps  1-3  are  to  construct 
the  sequence  L: 

Ti,i ,  L\t2i  •  ■  •  ?  T-mm ,  T2,i,  •  •  •  i  T2,n2 ■>•••■>  Ti,r, 


,nt 
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where  Ltj  is  a  sequence  of  triples: 

(*>.?>  2]))  •  •  •  )  B\PiJ,mi,j])- 

Steps  4-6  are  to  eliminate  duplicates  in  L  and  compress  the  array  L.  At  the  end  of  Step  6, 
the  array  L  contains  [p]  for  every  state  p  in  each  block  in  the  current  partition.  Furthermore, 
for  each  block  B  =  {pi, . . .  ,Pt},  [pi] ^  [P2] ,  •  •  •  5  [Pfc]  appear  consecutively  in  L. 

Step  7  identifies  blocks  that  can  be  split.  Note  that  even  if  there  is  a  single  j  such  that 
[?» j]  7^  fell?  we  may  end  up  splitting  the  block  Bi  and  thus  the  block  Bi  is  marked. 

Step  8  picks  one  of  the  marked  blocks  arbitrarily  and  splits  it.  If  the  block  B{  is  chosen, 
then  B{  is  split  into  Bi  and  Bt+i,  where  Bt+ 1  =  {p  £  -£?,|[p]  ^  te,i]}  anc^  is  updated  to 
be  Bi  -  Be+ 1.  After  the  splitting,  we  update  PARTITION  such  that  states  belonging  to 
the  same  block  appear  consecutively.  Note  that  we  could  have  split  all  those  blocks  that  are 
marked  instead  of  just  one  such  block  as  done  in  Step  8;  even  then,  the  worst  case  run-time 
of  the  algorithm  would  be  the  same. 

Analysis.  We  assume  that  there  are  n  +  m  processors,  one  for  each  state  and  one  for  each 
transition. 

Step  1  takes  0(1)  time  using  n  processors.  Steps  3,5,  and  7  also  take  0(1)  but  need 
m  processors.  In  Step  2,  prefix  computation  can  be  done  using  processors  in  O(logn) 
time  (see  Lemma  3).  In  Step  4,  we  need  to  sort  m  numbers  in  the  range  [0,n3],  and  hence, 
we  could  apply  Lemma  5  to  infer  that  it  can  be  done  in  O(logm)  =  O(logn)  time  using 
i 171  log  log  n  processors,  or  in  ne  time  using  ^  processors  for  any  fixed  e  <  1.  Step  6  takes 
O(log  m)  =  O(log  n)  time  using  processors  (see  Lemma  3).  In  Step  8,  prefix  computation 

takes  O(log  n)  time  using  ^  processors  and  the  rest  of  the  computation  can  be  completed 
in  0(1)  time  using  n  processors. 

Thus,  each  run  of  the  while  loop  can  be  completed  in  either:  1)  O(logn)  time  with  a 
total  work  of  m  log  log  n,  or  2)  0(n£)  time  with  a  total  work  of  0(jn).  Since  the  while  loop 
can  be  executed  at  most  n  times,  we  get  by  applying  Lemma  1: 

Theorem  1  RCPP  with  m  transitions  and  n  states  can  be  solved  1)  in  O(nlogn)  time 
using  ^  iog  l0g  n  CRCW  PRAM  processors,  or  2)  in  0(n1+t)  time  and  fc  CREW  PRAM 
processors,  for  any  fixed  t  <  1. 
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7 r  :=  7 r0;  split  :=  true 
while  split  do 

split  :=  false;  let  7t  =  {B\,  B2,  ■ .  ■ ,  Bi) 

Unmark  B\,  B2, . . . ,  Be 

1.  for  i  :=  1  to  n  in  parallel  do 

TEMP[{\  :=  T  S I Z  E[P  ART  IT  ION[i]  .state\ 

2.  Compute  the  prefix  sums  of  TEMP[l],  TEMP[ 2],  . . .  ,TEMP[n] 

Let  the  sums  be  tq,  V2,  ■  ■  ■ ,  vn 

3.  for  *  :=  1  to  n  in  parallel  do 

Si  :=  PARTITION^. state; 

Let  T[st]  be  {qu  . . .  ,qk] 
for  j  1  to  k  in  parallel  do 

Let  processor  in-charge  of  transition  ( S{,qj )  write  (i,j,B[qj])  in  L[ V{-\  +  j] 

4.  Sort  the  sequence  L  in  lexicographic  order. 

5.  for  i  :=  1  to  m  in  parallel  do  if  L[i\  —  L[i  +  1]  then  L[i\  :=  0 

6.  Compress  the  list  L  using  a  prefix  computation 

7.  for  each  block  B{  (1  <  i  <  l)  in  parallel  do 

for  each  j,  2  <  j  <  n,  in  parallel  do 
if  [qij]  /  [gx,i]  then  mark  Bi 

8.  if  there  is  at  least  one  marked  block  then 

split  :=  true;  i  i  +  1 

Pick  one  of  the  marked  blocks  (say  Bi)  arbitrarily 
for  each  p  in  Bi  do 

if  [p]  7 k  fe,i]  then 

B[p\  :=£+l 

Change  the  corresponding  entry  in  PARTITION  to  (p,  £  +  1) 

/*  Bi+ 1  :=  Bx  -  {p  e  Bi  :  [p]  =  [9i,i]}  and  Bi  :=  Bi  -  Be+1  */ 

Using  a  prefix  computation,  modify  P  ART  IT  ION  such  that  all  tuples 
corresponding  to  the  same  block  are  in  successive  positions. 

When  the  array  PARTITION  is  modified,  positions  of  some 
states  q' s  might  change;  inform  the  processors  associated  with 
the  corresponding  T(qY s  of  this  change.  7r  has  been  thus  modified. 


Figure  2:  Algorithm  I 
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4  Algorithm  II 

In  this  section,  we  present  a  parallel  version  of  Paige  and  Tarjan’s  algorithm  [10].  This 
algorithm  has  a  run  time  of  O(nlogn)  using  ^  log  n  CREW  PRAM  processors.  Thus,  this 
algorithm  is  nearly  optimal  with  respect  to  [10] ’s.  We  first  give  a  brief  description  of  Paige 
and  Tarjan’s  algorithm,  followed  by  the  parallel  algorithm. 

4.1  The  Sequential  Algorithm 

Paige  and  Tarjan  [10]  present  an  efficient  algorithm  for  the  “relational”  partitioning  problem 
of  an  “unlabeled”  transition  system  S  =  (Q,Qo,A,T),  where  T  C  Q  x  Q.  That  is,  there 
is  only  one  kind  of  relation.  Without  loss  of  generality,  they  assume  |T(p)|  >  1  for  all 
p  £  Q.  The  reason  is  that  given  an  initial  partition  7 r0,  it  can  be  refined  into  7ra  U  7t2,  where 
7r i  =  {B'  ^  0| B'  =  B  n  T~1(Q)}  and  7 r2  =  {B'  ±  0| B'  =  B  —  T~'i(Q)}.  Since  the  blocks  in 
7r2  will  not  be  split,  it  suffices  to  consider  only  7^. 

For  S  C  Q  and  a  partition  7 r,  we  define  split(S,n)  to  be  a  new  partition  7r'  such  that 
each  block  D  in  7r  is  replaced  by  D  fl  T~1(S)  and  D  —  T~1(S).  If  either  of  them  is  empty 
set,  then  it  is  not  included  in  7r'.  The  resulting  7r'  has  the  following  properties:  1)  %'  is  a 
refinement  of  7r,  and  2)  7 r'  consists  of  the  largest  blocks  that  are  stable  with  respect  to  S. 

The  major  idea  behind  Paige  and  Tarjan’s  algorithm  is  to  show  that  split(S—B ,  split(B ,  7r)) 
can  be  computed  in  0(|T-1(B')|)  time,  where  B'  is  smaller  of  B  and  S  —  B.  This  idea, 
called  the  process-smaller-half  strategy,  is  as  follows:  Suppose  S  is  a  union  of  some  blocks 
of  7 r  such  that  7r  is  stable  with  respect  to  S',  and  B  Q  S  is  in  7r.  For  every  block  D  E  7r,  if 
D  n  T~1(S )  =  0,  then  D  is  stable  with  respect  to  B  and  S  —  B;  otherwise,  D  can  be  split 
by  split(S  —  B ,  split (B ,7r))  into  three  blocks,  D\,  D?,  D3,  as  follows: 

1.  D\  =  D  —  T~1(S  —  B):  the  successors  of  D\  are  only  in  B  since  7r  is  stable  with  respect 
to  S. 

2.  Z)2  =  D  fl  T_1(B)  fl  T~1(S  —  B ):  the  successors  of  D2  are  in  both  B  and  S  —  B. 

3.  Ds  —  D  —  T~1(B):  the  successors  of  Z?3  are  only  in  S  —  B. 

For  p  G  Q  and  a  subset  S  of  Q,  let  count (p,  S)  be  the  number  of  the  next  states  of  p  in 
S,  that  is,  count(p,S)  =  l-S1  (~l  T(p)\.  Assuming  that  we  have  already  computed  count(p,S ) 
and  count(p,B)  for  all  p  €  D,  we  can  decide  which  of  the  three  blocks  Di,D2,D3  that  p 
belongs  as  follows: 
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1.  p  in  Di  if  count(p,B)  =  count(p,S ),  i.e.,  there  are  transitions  from  p  to  B  but  not  to 
S-B. 

2.  p  in  D2  if  0  <  count(p,B )  <  count  (p,S),  i.e.,  there  are  transitions  from  p  to  both  B 
and  S  —  B. 

3.  p  in  D3  if  count(p ,  B )  =  0,  i.e,  there  are  transitions  from  p  to  S  —  B  but  not  to  B. 

The  algorithm  uses  a  set  X  of  splitters.  An  element  in  A"  is  a  tree  of  height  0  or  1  with 
the  following  properties:  each  leaf  is  a  block  in  the  current  partition;  the  root  is  the  union 
of  its  children  blocks;  and  the  current  partition  is  stable  with  respect  to  the  root.  There  are 
six  major  steps  in  the  algorithm: 

In  Step  0,  X  is  initialized  with  one  tree  whose  children  are  blocks  in  the  initial  partition 
7r0.  Throughout  the  algorithm,  count(p ,  B)  is  maintained  for  each  state  p  in  Q  and  for  each 
block  B  that  is  a  root  in  X  such  that  p  €  T~l(B).  Step  1  selects  an  arbitrary  block  B  that  is 
going  to  be  used  to  split  the  current  partition  tv.  Step  2  computes  count(p,  B)  for  p  6  T~l(B) 
since  B  will  become  a  new  tree  root  in  X.  Step  3  carries  out  the  three  way  splitting  described 
above.  Step  4  updates  count (p,  S )  for  p  €  T~l(B)  since  B  has  been  eliminated  from  the  tree 
rooted  at  S  in  X.  For  each  block  D  that  has  been  split  into  D\,  D2,  D$,  Step  5  updates  tv 
to  include  the  new  blocks  D\,  D2^D^  and  also  inserts  them  into  X  as  potential  splitters. 

Analysis.  For  timing  analysis,  the  algorithm  uses  the  following  data  structures:  for  each 
block,  the  algorithm  keeps  the  size  and  maintains  its  member  states  as  a  doubly  linked 
list.  In  addition,  each  block  itself  is  a  member  of  a  doubly  linked  list.  For  each  state,  the 
algorithm  maintains  a  pointer  to  a  block  in  which  it  is  a  member. 

Step  0  takes  0(m)  time.  Step  1  can  be  completed  in  constant  time.  Steps  2-4  take 
0(\T~1(B)\)  time,  where  T~l{B)  =  1 2p6B^-1(p)*  Step  5  also  takes  0(\T~1(B)\)  time  since 
there  are  at  most  0(|r-1(i?)|)  marked  blocks. 

Each  time  state  p  is  in  a  chosen  splitter,  it  takes  |T_1(p)|  time  to  process  it  (in  Steps  2-5). 
Since  each  state  can  be  in  at  most  log  n  splitters  due  to  the  process-smaller-half  strategy,  the 
total  time  incurred  due  to  any  state  p  is  at  most  |T-1(p)|  logn.  Therefore,  the  total  time  is 

|F_1(p)|logn  =  (]T  |T_1(p)|) log n  —  mlogn 
p€Q  p£Q 

The  running  time  of  the  algorithm  is  O(mlogn)  and  the  space  used  is  0(n  +  m),  where 
n  =  jQ|  and  m  =  |T|. 
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4.2  The  Parallel  Algorithm 

Figures  4,  5  and  6  describe  how  to  implement  each  of  the  above  steps  in  parallel.  The  basic 
steps  are  the  same  as  those  of  the  sequential  algorithm.  There  are,  however,  some  intricate 
details  in  the  definitions  of  and  operations  on  the  data  structures  used  in  our  algorithm.  X 
is  the  collection  of  splitters.  Each  entry  in  X  is  a  tree  of  height  one  or  zero.  A  tree  of  height 
one  is  called  a  compound  splitter,  whereas  a  tree  of  height  zero  is  called  a  simple  splitter.  All 
the  simple  splitters  as  well  as  leaves  of  compound  splitters  are  blocks  in  the  current  partition. 
Moreover,  the  current  partition  is  stable  with  respect  to  each  root  in  X  (including  simple 
splitters).  We  do  not  maintain  the  current  partition  as  a  separate  data  structure,  since  the 
current  partition  can  be  readily  derived  from  X . 

Data  Structures.  We  employ  the  following  data  structures: 

•  ITRANSITIONS  is  an  array  of  size  m.  This  array  is  a  sequence  of  records,  one  record 
per  transition.  Each  record  contains  a  pair  (x,y)  which  corresponds  to  a  transition 
from  state  x  to  state  y ,  and  a  number  that  equals  to  count(x,  S),  where  S  is  the  root 
in  X  that  y  belongs  to.  These  records  are  ordered  according  the  second  component  of 
( x,y )  transition.  That  is,  transitions  to  state  1  are  placed  before  transitions  to  state 
2,  etc. 

•  IT  SIZE  is  an  array  of  size  n.  This  array  contains  |T’_1(p)|  for  each  p  in  Q. 

•  B  is  an  array  of  size  n.  For  each  state  p  in  Q ,  B[p ]  has  a  pointer  to  a  node  in  X  that 
p  belongs  to. 

•  X  ARRAY  is  an  array  of  size  0(n).  This  array  of  records  maintains  both  compound  and 
simple  splitters.  Each  compound  splitter  has  a  structure  shown  in  Figure  3.  Children 
of  a  compound  splitter  are  represented  as  a  doubly  linked  list.  Each  element  in  this  list 
is  a  block  in  the  current  partition  which  is  represented  as  a  doubly  linked  list  of  states. 
For  instance,  a  tree  with  three  children  blocks,  say  A,  B,  and  C  is  shown  in  Figure  3. 
Blocks  A,  B,  and  C  themselves  are  doubly  linked  lists  of  states  in  the  corresponding 
blocks.  The  root  of  the  tree  is  not  represented  as  a  separate  node  in  X ARRAY . 

Compound  splitters  themselves  are  doubly  linked.  This  linked  structure  is  useful  in  the 
following  sense:  When  the  current  splitting  block  is  removed  from  its  splitter  tree,  if 
this  compound  splitter  tree  has  two  children,  then  it  is  likely  that  this  splitter  becomes 
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simple.  If  this  happens,  the  splitter  tree  is  linked  to  the  list  of  simple  splitters.  In  the 
next  phase  of  the  algorithm,  we  choose  the  next  compound  splitter  following  the  link 
structure  of  compound  splitters.  A  similar  structure  is  adopted  for  simple  splitters  as 
well  (see  Figure  3). 

The  crucial  fact  about  X ARRAY  is  that  we  maintain  this  linked  structure  in  the  form 
of  an  array  of  records.  Each  record  has  four  pointers  and  a  state  id.  These  four  pointers 
are  used  to  represent  a  tree  structure,  as  shown  in  Figure  3. 

This  enables  us  to  perform  deletion  of  elements  from  these  lists  efficiently.  In  addition, 
we  could  retrieve  all  the  elements  of  any  list  efficiently  in  parallel.  For  instance  in 
Paige  and  Tarjan’s  algorithm,  one  of  the  basic  steps  to  be  performed  is  the  selection 
of  a  splitter  block  B. 

One  important  aspect  of  X ARRAY  is  that  blocks  always  occupy  mutually  disjoint  seg¬ 
ments  of  X ARRAY.  Each  segment  includes  contiguous  array  elements  of  X ARRAY. 
However,  the  number  of  array  elements  used  to  represent  a  block  can  be  larger  than 
the  size  of  the  block.  That  is,  there  can  be  some  array  elements  which  contain  no  state 
records  for  the  block.  Such  elements  are  called  “cavities.” 

Algorithm.  The  functionalities  of  each  step  in  the  parallel  algorithm  is  the  same  as  those 
of  a  corresponding  step  in  the  sequential  algorithm.  In  each  phase  of  Paige  and  Tarjan’s 
algorithm,  we  pick  a  splitter  block  B  that  is  a  child  of  a  compound  splitter  and  perform 
three  way  splitting  of  blocks,  possibly  including  B.  Deleting  an  element  from  any  block  can 
be  done  in  0(1)  time.  But  then,  this  might  create  ‘cavities’  in  the  array  X ARRAY .  When 
we  have  to  retrieve  any  block  B,  we  have  to  look  at  all  the  elements  of  X  that  B  used  to 
occupy  before  any  cavities  were  created  in  B.  Thus  it  seems  like  some  unnecessary  work 
may  be  done  in  retrieving  1 3.  On  the  other  hand,  parallel  retrieval  of  B  amounts  to  just 
one  prefix  computation.  Whenever  a  block  (say  D )  is  split  into  (say)  Di,  D2,  D3,  we  delete 
elements  from  D  (that  belong  to  Di  and  D2)  and  store  these  new  blocks  as  lists  by  extending 
the  array  (i.e.,  we  store  them  starting  from  the  first  unoccupied  position  in  the  array).  The 
remaining  list  of  D  will  be  called  D3. 

As  a  result,  now  even  though  D3  may  only  have  a  few  states,  this  list  is  stored  in  a  space 
that  D  used  to  occupy  before.  In  order  to  retrieve  D3  at  a  later  stage  we  will  have  to  search 
the  whole  segment  of  the  array  that  D  once  occupied.  Note  that  we  will  have  to  retrieve  any 
block  only  if  it  has  been  chosen  as  the  splitter  block.  In  our  analysis,  we  will  account  for  all 
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the  unnecessary  retrieval  work  performed. 

Analysis.  In  Step  0,  sorting  can  be  done  in  time  O(logm)  =  O(logn)  with  a  total  work  of 
0(ra  log  n )  (see  Lemma  4).  The  rest  of  the  operations  here  can  be  completed  in  O(log  n)  time 
with  a  total  work  of  0(n).  Step  1  can  be  performed  in  0(1)  time  with  a  single  processor. 
Step  2(a)  is  nontrivial  to  analyze  and  discussed  below  in  detail.  In  Step  2(b),  the  dominating 
operation  is  sorting,  which  takes  0(\T~1(B)\  log  n)  work  and  O(logn)  time.  The  rest  of  the 
tasks  such  as  prefix  computations  can  be  completed  within  the  same  time  and  0(|T  1(B)\  + 
\B\)  total  work.  Step  3(a)  can  be  done  in  0(1)  time  and  0(|T_1(B)|)  work.  Also  in  Step  3(b), 
sorting  takes  the  longest  time  to  complete.  The  total  work  done  is  0(|T-1(i?)|  log  n)  and 
the  time  needed  is  O(logn).  Steps  4  and  5  can  be  performed  in  0(1)  time  and  0(\T~1(B)\) 
work. 

Notice  that  any  state  of  Q  can  appear  as  a  member  of  some  splitting  block  at  most  log  n 
times.  Taking  into  account  Steps  1  through  5  (except  2(a)),  whenever  a  state  p  appears 
as  a  member  of  a  splitting  block,  we  spend  a  total  work  of  0(\T  1(p)|logn)  for  this  state. 
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Step  0: 

■K  :=  7T0 

Let  Sq  be  a  splitter  whose  leaves  are  all  the  blocks  in  7r0. 

/*Note  that  the  union  of  the  leaves  of  Sq  =  Q)  */ 

V  :=  {SQ} 

/*  Initially  all  blocks  in  ir  are  unmarked  */ 

For  every  (x,y)  in  ITRANSITIONS  create  (y,x)  and  sort  these  tuples. 

Using  this  sorted  tuples 

for  every  p  in  Q  in  parallel  do 

count  (p,  Q)  :=  \{q  €  Q\q  €  F(p)}| 

end  for 

while  X  not  empty  do 
Step  1: 

Select  and  remove  any  S  from  X . 

Let  B\  and  B2  be  the  first  two  blocks  in  S. 
if  |i?i|  <  \B2\  then  B  B\  else  B  :=  B2 
Remove  B  from  S  and  make  it  a  simple  splitter, 
if  S  includes  more  than  two  blocks 

thenPut  S  back  in  its  previous  place, 
else  make  it  a  simple  splitter. 

Step  2(a): 

Let  v  be  the  size  of  the  portion  of  X ARRAY  used  to  store  B. 

Using  v  processors,  perform  a  prefix  operation  to  retrieve  B 
in  parallel  and  put  the  elements  of  B  in  TEMP'. 

At  the  same  time,  compress  the  portion  that  B  occupies. 


Figure  4:  Algorithm  II 
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Step  2(b): 

Let  TEMP "  be  the  list  of  ITSIZE[q ]  for  elements  q's  in  TEMP' 

\B |  processors  perform  a  prefix  sums  operation  on  the  sequence  TEMP" 

Using  the  above  prefix  sums  and  |T-1(JB)|  processors, 
retrieve  T~1(B)  and  store  the  elements  in  TEMP. 

Sort  the  list  TEMP. 

Using  this  sorted  order  and  two  prefix  operations, 
for  every  p  in  TEMP  in  parallel  do 
count (p,  B)  :=  \{q  €  B\q  €  T(p)}\ 
end  for 
Step  3(a): 

for  every  p  in  TEMP  in  parallel  do 
Let  D  be  the  block  including  p  in  n. 

Mark  D. 

if  count (p,B)  —  count(p,  5)  then 

label  p  as  type  I  /*  p  belongs  to  D\  —  D  —  T~1(B)  */ 
else  /*  0  <  count(p ,  B )  <  count(p ,  S)  */ 

label  p  as  type  II  /*  p  belongs  to  D2  —  D  fl  T~l(B)  fi  T~l(S  —  B)  */ 

end  for 

Step  3(b): 

\TEMP\  processors  perform  a  prefix  sums  operation  and 
pick  elements  of  TEMP  that  are  of  type  I  or  II. 

Let  the  new  list  be  TEMP'. 

for  every  p  in  TEMP'  in  parallel  do 

Create  a  tuple  (B\p\,  POSp)  and  add  the  tuple  to  TUPLES , 
where  POSp  is  the  position  of  elements  p  in  X ARRAY 
Sort  TUPLES  in  lexicographic  order  /*  These  elements  have  to  be  deleted  */ 
Append  the  newly  created  blocks  to  X ARRAY 
Delete  these  elements  from  their  respective  blocks  in  X ARRAY ; 

The  sorted  array  TU PLES  helps  in  setting  the  pointers  correctly 
in  cases  where  more  than  one  successive  nodes  will  have  to  be 
deleted  in  parallel  from  any  list 


Figure  5:  Algorithm  II  (continued) 
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Step  4: 

for  every  p  in  TEMP  in  parallel  do 

count(p,  S)  :=  count(p,  S)  —  count(p,  B) 

end  for 

Step  5: 

for  every  D  in  7r  such  that  D  marked  in  parallel  do 
Unmark  D 

P  :=  {A  7^01*  =  1,2,3} 
if  \P\  =  1  then  /*  D  is  not  split  */ 
retain  the  original  position  of  D 

else 

if  D  is  a  child  of  a  compound  splitter  S'  in  X  then 
Make  D\  and  D2  (if  nonempty)  as  leaves  of  S'. 

else 

Make  D  a  compound  splitter  with  leaves  that  are 
all  the  blocks  A  in  P- 

end  for 
end  while 


Figure  6:  Algorithm  II  (continued) 
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For  every  splitting  block  B ,  we  also  spend  a  total  work  of  0(\B\).  Therefore,  summing 
over  the  whole  algorithm,  the  total  work  done  in  Steps  0  through  5  (excepting  Step  2(a))  is 
J2PeQ  |r_1(p)|  log2n,  which  simplifies  to  0(mlog2n). 

We  now  analyze  Step  2(a).  Assume  that  we  never  compress  X ARRAY.  How  large  can 
the  array  grow?  Observe  that  whenever  we  split  any  block  D,  we  might  create  new  blocks 
and  hence  use  new  space  in  X ARRAY .  Clearly,  an  upper  bound  for  the  new  space  used  in 
any  run  of  the  while  loop  is  |T_1(B)|,  implying  that  we  will  never  have  to  extend  X ARRAY 
to  more  than  m  log  n  in  length. 

How  much  total  time  is  spent  in  Step  2(a)?  Notice  that  whenever  we  have  to  retrieve 
a  block  (say  B)  as  a  splitter  block,  we  have  to  search  through  the  whole  space  (including 
cavities)  that  B  is  stored  in.  If  each  such  region  is  searched  no  more  than  once,  then  the  total 
work  done  in  Step  2(a)  is  clearly  O(mlogn).  But  the  same  region  may  have  to  be  searched 
again  and  again.  However,  whenever  we  retrieve  a  block  J3,  we  compress  it  immediately. 
We,  thus,  charge  an  additional  work  of  0(|jB|)  for  compression  whenever  a  splitter  block  is 
retrieved  (this  accounts  for  the  next  time  that  this  region  may  have  to  be  searched). 

Therefore,  the  total  work  needed  for  repeated  searches  of  the  regions  is  no  more  than 
0(|i3|)  summed  over  all  the  splitter  blocks  used  in  the  whole  algorithm,  which  is  (?(nlogn). 
In  sum,  the  total  work  needed  for  processing  Step  2(a)  in  the  whole  algorithm  is  0(m  log  n). 

The  total  time  spent  in  each  run  of  the  while  loop  is  0(log  n),  but  there  could  be  at  most 
n  runs  of  the  while  loop.  Thus,  the  total  run  time  is  O(nlogn). 

An  application  of  Lemma  1  yields  the  following  theorem. 

Theorem  2  Single  relation  CRPP  can  be  solved  in  O(nlogn)  time  using  ^  log  n  CREW 
PRAM  processors. 

Memory  Management.  As  stated,  the  above  algorithm  seems  to  use  0(m  log  n)  memory 
to  maintain  X ARRAY.  But  we  could  easily  reduce  the  memory  needed  for  X ARRAY  to 
0(n)  as  follows:  Whenever  the  memory  needed  to  store  X ARRAY  exceeds  2 n  records,  we 
perform  a  compression  of  the  array  so  that  at  the  end  of  compression,  X ARRAY  will  be  of 
size  n  records.  The  amount  of  work  done  for  compression  is  0(n).  Such  a  compression  is 
done  at  most  f  log  n  times  in  the  algorithm  since  the  array  can  only  grow  up  to  m  log  n  in 
size.  Therefore,  the  total  work  done  for  compression  is  O(ralogn). 
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5  Conclusions 


We  have  presented  two  parallel  algorithms  for  RCPP.  An  interesting  open  problem  is  to 
design  optimal  versions  of  these  algorithms.  The  bottleneck  in  these  algorithms  is  the  use 
of  sorting.  Another  important  open  problem  will  be  to  design  algorithms  with  better  run 
times.  Since  RCPP  is  known  to  be  "P-complete,  a  reasonable  time  to  aim  for  will  be  0(nc), 
for  any  fixed  e  <  1. 

References 

[1]  C.  Alvarez,  J.L.  Balcazar,  J.  Gabarro,  and  M.  Santha,  Parallel  Complexity  in  the  Design 
and  Analysis  of  Concurrent  Systems,  Springer- Verlag  LNCS  505,  1991. 

[2]  P.C.P.  Bhatt,  K.  Diks,  T.  Hagerup,  V.C.  Prasad,  T.  Radzik,  and  S.  Saxena,  Improved 
Deterministic  Parallel  Integer  Sorting,  Information  and  Computation  94,  1991,  pp.  29- 
47. 

[3]  R.P.  Brent,  The  Parallel  Evaluation  of  General  Arithmetic  Expressions,  Journal  of  the 
ACM  21(2),  1974,  pp.  201-208. 

[4]  R.  Cole,  Parallel  Merge  Sort,  SIAM  Journal  on  Computing  17,  1988,  pp.  770-785. 

[5]  J.  Ja  Ja,  An  Introduction  to  Parallel  Algorithms ,  Addison- Wesley  Publications,  1992. 

[6]  P.C.  Kanellakis,  S.A.  Smolka,  CCS  Expressions,  Finite  State  Processes,  and  Three  Prob¬ 
lems  of  Equivalence,  Proc.  2nd  Annual  ACM  Symposium  on  Principles  of  Distributed 
Computing,  1983,  pp.  228-240. 

[7]  P.C.  Kanellakis,  S.A.  Smolka,  CCS  Expressions,  Finite  State  Processes,  and  Three 
Problems  of  Equivalence,  Information  and  Computation  86,  1990,  pp.  43-68. 

[8]  T.  Leighton,  Introduction  to  Parallel  Algorithms  and  Architectures:  Arrays-Trees- 
Hypercubes ,  Morgan-Kaufmann  Publishers,  San  Mateo,  California,  1992. 

[9]  R.  Milner,  Communications  and  Concurrency ,  Prentice-Hall  Publishers,  1989. 

[10]  R.  Paige  and  R.E.  Tarjan,  Three  Partition  Refinement  Algorithms,  SIAM  Journal  on 
Computing,  16(6),  1987,  pp.  973-989. 


20 


[11]  S.  Zhang  and  S.A.  Smolka,  Towards  Efficient  Parallelization  of  Equivalence  Checking 
Algorithms,  Manuscript,  1993. 


21 


